Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3414][SQL] Replace LowerCaseSchema with Resolver #2382

Closed
wants to merge 5 commits into from

Conversation

marmbrus
Copy link
Contributor

This PR introduces a subtle change in semantics for HiveContext when using the results in Python or Scala. Specifically, while resolution remains case insensitive, it is now case preserving.

This PR is a follow up to #2293 (and to a lesser extent #2262 #2334).

In #2293 the catalog was changed to store analyzed logical plans instead of unresolved ones. While this change fixed the reported bug (which was caused by yet another instance of us forgetting to put in a LowerCaseSchema operator) it had the consequence of breaking assumptions made by MultiInstanceRelation. Specifically, we can't replace swap out leaf operators in a tree without rewriting changed expression ids (which happens when you self join the same RDD that has been registered as a temp table).

In this PR, I instead remove the need to insert LowerCaseSchema operators at all, by moving the concern of matching up identifiers completely into analysis. Doing so allows the test cases from both #2293 and #2262 to pass at the same time (and likely fixes a slew of other "unknown unknown" bugs).

While it is rolled back in this PR, storing the analyzed plan might actually be a good idea. For instance, it is kind of confusing if you register a temporary table, change the case sensitivity of resolution and now you can't query that table anymore. This can be addressed in a follow up PR.

Follow-ups:

  • Configurable case sensitivity
  • Consider storing analyzed plans for temp tables

@marmbrus
Copy link
Contributor Author

/cc @liancheng

@marmbrus
Copy link
Contributor Author

/cc @ericl

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have started for PR 2382 at commit c2f2ec8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have finished for PR 2382 at commit c2f2ec8.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have started for PR 2382 at commit 5b93711.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have finished for PR 2382 at commit 5b93711.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have started for PR 2382 at commit 5b93711.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have started for PR 2382 at commit 219805a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have finished for PR 2382 at commit 5b93711.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class JavaSparkContext(val sc: SparkContext)
    • class TaskCompletionListenerException(errorMessages: Seq[String]) extends Exception
    • class Dummy(object):
    • class JavaStreamingContext(val ssc: StreamingContext) extends Closeable

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have finished for PR 2382 at commit 219805a.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -22,4 +22,9 @@ package org.apache.spark.sql.catalyst
* Analysis consists of translating [[UnresolvedAttribute]]s and [[UnresolvedRelation]]s
* into fully typed objects using information in a schema [[Catalog]].
*/
package object analysis
package object analysis {
type Resolver = (String, String) => Boolean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolver probably a general name, can we use a more precise name for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will actually end up providing more general resolution functionality in the long term. I've added some scala doc for clarity though.

@liancheng
Copy link
Contributor

LGTM except some minor issues mentioned in the comments :)

@liancheng
Copy link
Contributor

Oh, one more thing, please help rename this test case:

test("SPARK-3414 regression: should store analyzed logical plan when registering a temp table") {

@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have started for PR 2382 at commit 01cc29a.

  • This patch merges cleanly.

…sensitive resolution is still case preserving.
@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have started for PR 2382 at commit c21171e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have finished for PR 2382 at commit c21171e.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging

@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have finished for PR 2382 at commit 01cc29a.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging

@asfgit asfgit closed this in 293ce85 Sep 20, 2014
@marmbrus marmbrus deleted the lowercase branch September 22, 2014 19:53
(nestedFields, expression.dataType) match {
case (Nil, _) => expression
case (requestedField :: rest, StructType(fields)) =>
val actualField = fields.filter(f => resolver(f.name, requestedField))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a problem here. Currently a.b[1].c.d will be parsed as GetField(GetField(GetItem(Unresolved("a.b"), 1), "c"), "d") , so the case-sensitive-check only happens when resolve Unresolved("a.b") to GetField(Attribute("a"), "b"). Something like "SELECT a[0].A.A from nested" will fail for hql on case-sensitive-check.
I think we should also do this check in GetField.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. Right now you can't make a SQLContext case insensitive, but when you can this will be problem. Maybe you should note this on SPARK-3617

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, sorry... Is that how the HiveQL parser will do it too? I'm not a huge fan of moving resolution logic into the expressions. What about a rule that only ran in case insensitive mode that fixes unresolved GetFields?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this bug exists in HiveQL. I have opened a PR to fix this(adding a rule to fix unresolved GetFields).#2543 Need your comments :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants